Case Study: Police Stops in Oakland


1. Formulate your question

Reflexivity

  1. Whether Black people in Oakland might be more likely to be stopped than White people
  2. Whether Black people who are stopped might be more likely to have contraband
  1. What do I already know about this subject?
  2. Why am I studying this?
  3. What do I expect or hope to find/learn, and why?
  4. Who is affected by this topic, and how am I connected to them?

(Adapted from Tanweer et al. (2021), 14-15, and Liboiron (2021))

Set up our workspace

Packages

library(tidyverse)   # for working with the data
library(lubridate)   # for working with datetime data

library(skimr)       # generate a text-based overview of the data
library(visdat)      # generate plots visualizing data types and missingness
library(plotly)      # quickly create interactive plots

Get the Data

data_dir = 'data'
target_file = file.path(data_dir, 'oakland.zip')

if (!dir.exists(data_dir)) {
    dir.create(data_dir)
}
if (!file.exists(target_file)) {
    download.file('https://stacks.stanford.edu/file/druid:yg821jf8611/yg821jf8611_ca_oakland_2020_04_01.csv.zip', 
                  target_file)
}

2. Read in your data

The dataset is a zipped csv or comma-separated value file. CSVs are structured like Excel spreadsheets, but are stored in plain text rather than Excel’s format.

dataf = read_csv(target_file)
## Rows: 133407 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (16): raw_row_number, location, beat, subject_race, subject_sex, officer_assignment, type, ...
## dbl   (3): lat, lng, subject_age
## lgl   (7): arrest_made, citation_issued, warning_issued, contraband_found, contraband_drugs, con...
## date  (1): date
## time  (1): time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3. Check the packaging

Peng and Matsui (2016) use some base R functions to look at dimensions of the dataframe and column (variable) types. skimr is more powerful.

## May take a couple of seconds
skim(dataf)
## ── Data Summary ────────────────────────
##                            Values
## Name                       dataf 
## Number of rows             133407
## Number of columns          28    
## _______________________          
## Column type frequency:           
##   character                16    
##   Date                     1     
##   difftime                 1     
##   logical                  7     
##   numeric                  3     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##    skim_variable                 n_missing complete_rate min max empty n_unique whitespace
##  1 raw_row_number                        0        1        1  71     0   133407          0
##  2 location                             51        1.00     1  78     0    60723          0
##  3 beat                              72424        0.457    3  19     0      129          0
##  4 subject_race                          0        1        5  22     0        5          0
##  5 subject_sex                          90        0.999    4   6     0        2          0
##  6 officer_assignment               121431        0.0898   5  97     0       20          0
##  7 type                              20066        0.850    9  10     0        2          0
##  8 outcome                           34107        0.744    6   8     0        3          0
##  9 search_basis                      92250        0.309    5  14     0        3          0
## 10 reason_for_stop                       0        1       14 197     0      113          0
## 11 use_of_force_description         116734        0.125   10  10     0        1          0
## 12 raw_subject_sdrace                    0        1        1   1     0        7          0
## 13 raw_subject_resultofencounter         0        1        7 213     0      315          0
## 14 raw_subject_searchconducted           0        1        2  24     0       34          0
## 15 raw_subject_typeofsearch          52186        0.609    2 112     0      417          0
## 16 raw_subject_resultofsearch       111633        0.163    5  95     0      298          0
## 
## ── Variable type: Date ─────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min        max        median     n_unique
## 1 date                  2          1.00 2013-04-01 2017-12-31 2015-07-19     1638
## 
## ── Variable type: difftime ─────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min    max        median n_unique
## 1 time                  2          1.00 0 secs 86340 secs 16:12      1439
## 
## ── Variable type: logical ──────────────────────────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate   mean count                  
## 1 arrest_made                0         1     0.121  FAL: 117217, TRU: 16190
## 2 citation_issued            0         1     0.394  FAL: 80836, TRU: 52571 
## 3 warning_issued             0         1     0.231  FAL: 102545, TRU: 30862
## 4 contraband_found       92250         0.309 0.149  FAL: 35005, TRU: 6152  
## 5 contraband_drugs       92250         0.309 0.0844 FAL: 37684, TRU: 3473  
## 6 contraband_weapons     92250         0.309 0.0299 FAL: 39928, TRU: 1229  
## 7 search_conducted           0         1     0.309  FAL: 92250, TRU: 41157 
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   mean      sd     p0    p25    p50    p75   p100 hist 
## 1 lat                 114         0.999   37.8  0.0284   37.4   37.8   37.8   37.8   38.1 ▁▁▇▁▁
## 2 lng                 114         0.999 -122.   0.0432 -122.  -122.  -122.  -122.  -119.  ▇▁▁▁▁
## 3 subject_age      102724         0.230   33.2 13.3      10     23     29     41     97   ▇▆▃▁▁

For our motivating questions

Missing values


But this raises a warning about large data

vis_miss(dataf)

So we’ll use sample_n() to draw a subset

set.seed(2021-09-28)
dataf_smol = sample_n(dataf, 1000)

vis_miss(dataf_smol)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Arguments in vis_miss() are useful for picking up patterns in missing values

## cluster = TRUE uses hierarchical clustering to order the rows
vis_miss(dataf_smol, cluster = TRUE) +
    coord_flip()

Several variables related to search outcomes are missing together

contraband_found, contraband_drugs, contraband_weapons, search_basis, use_of_force_description, raw_subject_typeofsearch, and raw_subject_resultofsearch

However, search_conducted is complete

A critical question

When a search has been conducted, do we know whether contraband was found?

dataf  |>  
    filter(search_conducted) |>
    count(search_conducted, is.na(contraband_found))

4. Look at the top and the bottom of your data

With 28 columns, the dataframe is too wide to print in a readable way.

Instead we’ll use the base R function View() in an interactive session. This shows us an Excel-like spreadsheet presentation of a dataframe.

View() can cause significant problems if you use it with a large dataframe on a slower machine. So we’ll use a pipe: first extract the head() or tail() of the dataset, and then View() it. We’ll also go ahead and view dataf_smol, the subset we created for visdat above.

dataf |> 
    head() |> 
    View()

dataf |> 
    tail() |> 
    View()

View(dataf_smol)

Some of my observations:

We can also use skimr to check data quality by looking at the minimum and maximum values. Do these ranges make sense for what we expect the variable to be?

skim(dataf)
## ── Data Summary ────────────────────────
##                            Values
## Name                       dataf 
## Number of rows             133407
## Number of columns          28    
## _______________________          
## Column type frequency:           
##   character                16    
##   Date                     1     
##   difftime                 1     
##   logical                  7     
##   numeric                  3     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##    skim_variable                 n_missing complete_rate min max empty n_unique whitespace
##  1 raw_row_number                        0        1        1  71     0   133407          0
##  2 location                             51        1.00     1  78     0    60723          0
##  3 beat                              72424        0.457    3  19     0      129          0
##  4 subject_race                          0        1        5  22     0        5          0
##  5 subject_sex                          90        0.999    4   6     0        2          0
##  6 officer_assignment               121431        0.0898   5  97     0       20          0
##  7 type                              20066        0.850    9  10     0        2          0
##  8 outcome                           34107        0.744    6   8     0        3          0
##  9 search_basis                      92250        0.309    5  14     0        3          0
## 10 reason_for_stop                       0        1       14 197     0      113          0
## 11 use_of_force_description         116734        0.125   10  10     0        1          0
## 12 raw_subject_sdrace                    0        1        1   1     0        7          0
## 13 raw_subject_resultofencounter         0        1        7 213     0      315          0
## 14 raw_subject_searchconducted           0        1        2  24     0       34          0
## 15 raw_subject_typeofsearch          52186        0.609    2 112     0      417          0
## 16 raw_subject_resultofsearch       111633        0.163    5  95     0      298          0
## 
## ── Variable type: Date ─────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min        max        median     n_unique
## 1 date                  2          1.00 2013-04-01 2017-12-31 2015-07-19     1638
## 
## ── Variable type: difftime ─────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min    max        median n_unique
## 1 time                  2          1.00 0 secs 86340 secs 16:12      1439
## 
## ── Variable type: logical ──────────────────────────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate   mean count                  
## 1 arrest_made                0         1     0.121  FAL: 117217, TRU: 16190
## 2 citation_issued            0         1     0.394  FAL: 80836, TRU: 52571 
## 3 warning_issued             0         1     0.231  FAL: 102545, TRU: 30862
## 4 contraband_found       92250         0.309 0.149  FAL: 35005, TRU: 6152  
## 5 contraband_drugs       92250         0.309 0.0844 FAL: 37684, TRU: 3473  
## 6 contraband_weapons     92250         0.309 0.0299 FAL: 39928, TRU: 1229  
## 7 search_conducted           0         1     0.309  FAL: 92250, TRU: 41157 
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   mean      sd     p0    p25    p50    p75   p100 hist 
## 1 lat                 114         0.999   37.8  0.0284   37.4   37.8   37.8   37.8   38.1 ▁▁▇▁▁
## 2 lng                 114         0.999 -122.   0.0432 -122.  -122.  -122.  -122.  -119.  ▇▁▁▁▁
## 3 subject_age      102724         0.230   33.2 13.3      10     23     29     41     97   ▇▆▃▁▁

More observations:

5. Check your Ns (and) 6. Validate with at least one external data source

Screenshot of the two summary tables from the Oakland report. Source: https://cao-94612.s3.amazonaws.com/documents/OPD-Racial-Impact-Report-2016-2018-Final-16Apr19.pdf, page 8

From dates to years

dataf |> 
    mutate(year = year(date)) |> 
    filter(year %in% c(2016, 2017)) |> 
    count(year)
dataf |> 
    mutate(year = year(date)) |> 
    filter(year %in% c(2016, 2017)) |> 
    count(year, subject_sex)
## Men in 2016 in the report vs. our data: 8.2%
(24576 - 22563) / 24576
## [1] 0.08190918
## Women in 2016 in the report vs. our data: 3.6%
(7965 - 7677) / 7965
## [1] 0.03615819
## All of 2016 in the report vs. our data: 7.1%
(32569 - 30268) / 32569
## [1] 0.07065

7. Make a plot

Peng and Matsui note that plots are useful for both checking and setting expectations

dataf |> 
    mutate(year = year(date)) |> 
    ggplot(aes(year)) +
    geom_bar()
## Warning: Removed 2 rows containing non-finite values (stat_count).

How about counts per year by race/ethnicity?

dataf |> 
    mutate(year = year(date)) |> 
    ggplot(aes(year, fill = subject_race)) +
    geom_bar()
## Warning: Removed 2 rows containing non-finite values (stat_count).

Let’s switch from bars to points and lines and change up the color palette

dataf |> 
    mutate(year = year(date)) |> 
    ggplot(aes(year, color = subject_race)) +
    geom_point(stat = 'count') +
    geom_line(stat = 'count') +
    scale_color_brewer(palette = 'Set1')
## Warning: Removed 2 rows containing non-finite values (stat_count).
## Removed 2 rows containing non-finite values (stat_count).

plotly::ggplotly() creates an interactive version of a ggplot object

ggplotly()
## Warning: Removed 2 rows containing non-finite values (stat_count).
## Removed 2 rows containing non-finite values (stat_count).

8. Try the easy solution first

Let’s translate our natural-language research questions into statistical questions:

  1. Whether Black people in Oakland might be more likely to be stopped than White people \[ \Pr(stopped | Black) \textrm{ vs } \Pr(stopped | White) \]

  2. Whether Black people who are stopped might be more likely to have contraband \[ \Pr(contraband | stopped, searched, Black) \textrm{ vs } \Pr(contraband | stopped, searched, White) \]

Mathematical aside

\[ \frac{\Pr(stopped | Black)}{\Pr(stopped | White)} = \frac{\Pr(Black|stopped)}{\Pr(Black)} \frac{\Pr(White)}{\Pr(White|stopped)} \]

Stops, by race

\[ \Pr(Black|stopped) \]

dataf |> 
    count(subject_race) |> 
    mutate(share = n / sum(n)) |> 
    arrange(desc(share)) |> 
    mutate(share = scales::percent(share, accuracy = 1))
r/e residents stops ratio
Black 24% 59% 2.5
Hispanic 27% 20% 0.7
White 27% 12% 0.4
API 16% 6% 0.4

Searches, by race

## What fraction of stops had a search? 
dataf |> 
    count(search_conducted) |> 
    mutate(share = n / sum(n))

Across all subjects, 31% of stops involved a search.

ggplot(dataf, aes(subject_race, fill = search_conducted)) +
    geom_bar(position = position_fill()) +
    scale_fill_manual(values = c('transparent', 'red'))

dataf |> 
    count(subject_race, search_conducted) |> 
    group_by(subject_race) |> 
    mutate(rate = n / sum(n)) |> 
    ungroup() |> 
    filter(search_conducted) |> 
    mutate(rate = scales::percent(rate, accuracy = 1))

Contraband finds, by race

\[ \Pr(contraband | searched, stopped, Black) \] We want to filter() down to only stops where a search was conducted

dataf |> 
    filter(search_conducted) |> 
    ggplot(aes(subject_race, fill = contraband_found)) +
    geom_bar(position = position_fill()) +
    scale_fill_manual(values = c('transparent', 'blue')) +
    ylim(0, .2)
## Warning: Removed 5 rows containing missing values (geom_bar).

dataf |> 
    filter(search_conducted) |> 
    count(subject_race, contraband_found) |> 
    group_by(subject_race) |> 
    mutate(rate = n / sum(n)) |> 
    ungroup() |> 
    filter(contraband_found) |> 
    mutate(rate = scales::percent(rate, accuracy = 1))

Results

This preliminary analysis indicates that

9. Follow up

What are some further directions we could take this analysis?

Discussion questions

  1. Suppose you’ve conducted this EDA because you’re working with an activist organization that promotes defunding the police and prison abolition. Should you share the preliminary findings above with your organization contacts?

  2. What influence would the following factors make to your answer?

    • Funding: Whether you’re being paid as a consultant vs. volunteering your expertise
    • Values: Your own views about policing and prisons
    • Relationships: Whether you are friends with members of the activist organization and/or police
    • Communications: The degree to which you can control whether and how the organization will publish your preliminary findings
    • Timeliness: Whether these findings are relevant to a pending law or policy change
  3. What other factors should be taken into account as you decide whether to share your findings? Or not taken into account?

  4. How has this “raw data” been shaped by the journey of the data to get to us?

Lab

The lab related to this material is available at https://github.com/data-science-methods/lab-w06-eda.

References

Huebner, Marianne, Werner Vach, and Saskia le Cessie. 2016. “A Systematic Approach to Initial Data Analysis Is Good Research Practice.” The Journal of Thoracic and Cardiovascular Surgery 151 (1): 25–27. https://doi.org/10.1016/j.jtcvs.2015.09.085.
Liboiron, Max. 2021. Pollution Is Colonialism. Duke University Press. https://books.google.com?id=NL4lEAAAQBAJ.
Peng, Roger D., and Elizabeth Matsui. 2016. The Art of Data Science: A Guide for Anyone Who Works with Data. Leanpub. http://leanpub.com/artofdatascience.
Pierson, Emma, Camelia Simoiu, Jan Overgoor, Sam Corbett-Davies, Daniel Jenson, Amy Shoemaker, Vignesh Ramachandran, et al. 2020. “A Large-Scale Analysis of Racial Disparities in Police Stops Across the United States.” Nature Human Behaviour, May, 1–10. https://doi.org/10.1038/s41562-020-0858-1.
Tanweer, Anissa, Emily Kalah Gade, P. M. Krafft, and Sarah K. Dreier. 2021. “Why the Data Revolution Needs Qualitative Thinking.” Harvard Data Science Review, July. https://doi.org/10.1162/99608f92.eee0b0da.